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Abstract 

We show that the number of t-ary trees with path length equal to p is 
exp log t (1 + o(l))), where h{x) = —x logx— (1— x) log(l— x) is the 

binary entropy function. Besides its intrinsic combinatorial interest, the ques- 
tion recently arose in the context of information theory, where the number of 
i-ary trees with path length p estimates the number of universal types, or, 
equivalently, the number of different possible Lempel-Ziv'78 dictionaries for 
sequences of length p over an alphabet of size t. 
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1 Introduction 

Path length is an important global parameter of a tree that arises in various com- 
putational contexts (cf. (7J Sec. 2.3.4.5]). Although the distribution of path lengths 
among trees with a given number of nodes has been studied, the problem of estimat- 
ing their distribution by path length alone has remained open. The question recently 
arose in an information-theoretic context, in connection with the notion of universal 
type [I0],[9], based on the incremental parsing of Ziv and Lempel (LZ78) [H]. When 
applied to a t-ary sequence, the LZ78 parsing produces a dictionary of strings that is 
best represented by a t-ary tree whose path length corresponds to the length of the 
sequence. Two sequences of the same length are said to be of the same universal type 
if they yield the same t-ary parsing tree. Sequences of the same universal type are, 
in a sense, statistically indistinguishable, as the variational distance between their 
empirical probability distributions of any finite order vanishes in the limit [H [TO] . 
Universal types generalize the notion underlying the classical method of types, which 
has lead to important theoretical results in information theory [3J. Of great interest 
in this context is the estimation of the number of different types for sequences of 

* Hewlett-Packard Laboratories, Palo Alto, CA 94304, USA. Part of this work was done while the 
author was with the Mathematical Sciences Research Institute (MSRI), Berkeley, California, USA. 
E-mail: gseroussi@ieee.org. 



1 



a given length. For universal types, this translates to the number of different LZ78 
dictionaries for t-ary sequences of a given length, or, equivalently, the number of t-ary 
trees with a given path length, which is the subject of this paper. 

First, we present some definitions and formalize the problem. Fix an integer t > 2. 
A t-ary tree T is defined recursively as either being empty or consisting of a root node 
r and the nodes of t disjoint, ordered, t-ary (sub-) trees 7\, T 2 , . . . , T t , any number of 
which may be empty (cf. [H Sec. 2.3.4.5]). When Tj is not empty, we say that there 
is an edge from r to the root r' of T, and that r' is a child of r. The total number 
of nodes of T is zero if T is empty, or ny = 1 + Ylt=i n Ti otherwise. A node of T is 
called a leaf if it has no children. The depth of a node v G T is defined as the number 
of edges traversed to get from the root r to v. We denote by Dj, j > 0, the number 

of nodes at depth j in T. The sequence {D^} is called the profile of T; we consider 

( r r\ _ 

only finite trees, so {D- } has finite support. The path length of a non-empty tree 
T, denoted by pt, is the sum of the depths of all the nodes in T, namely 



The subscript T in ut and px will be omitted in the sequel when the tree being 
discussed is clear from the context. We call a t-ary tree with n nodes a [t,n] tree. A 
[t,n] tree with path length equal to p will be called a [t,n,p] tree, and a t-ary tree 
with path length equal to p and an unspecified number of nodes will be referred to 
as a [t, -,p] tree. 

Let C t {n) denote the number of [t,n] trees, and L t (p) the number of [t, -,p] trees. 
It is well known [7, p. 589] that 



In the binary case (t = 2), these are the well known Catalan numbers that arise in 
many combinatorial contexts. The determination of L t (p), on the other hand, has 
remained elusive, even for t = 2. Consider the bivariate generating function B(w,z) 
defined so that the coefficient of w p z n in B(w,z) counts the number of [2,n,p] trees. 
B(w,z) satisfies the functional equation [3 p. 595] 



However, deriving the generating function, B(w, 1), of the numbers I/2(p) from this 
equation appears quite challenging. Nevertheless, the equation and others of similar 
structure have been studied in the literature. In particular, the limiting distribution 
of the path length for a given number of nodes is related to the area under a Brownian 
excursion [Til H31 EE2] , which is also known as an Airy distribution. This distribution 
occurs in many combinatorial problems of theoretical and practical interest (cf. [1] 
and references therein). These studies, however, have not yielded explicit asymptotic 
estimates for the numbers L t (p). 





n>0,t>l. 



(1) 



zB(w,wz) 2 = B(w,z) — 1. 
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Let h(x) = — xlogx— (1— x) log(l— x) denote the binary entropy function0The 
main result of this paper is the following asymptotic estimate of L t (p). 

Theorem 1 Let a = hit- 1 ) t log*. Then, L t (p) = exp ( r-^-(l + o(l)) Y 

V logp / 

The theorem is derived by proving matching upper and lower bounds on log L t (p). 
The proof is presented in Section [2j 

We remark that Knessl and Szpankowski [6] have recently applied the WKB 
heuristic [T] to obtain an asymptotic expansion of log L 2 (p) using tools of complex 
analysis. The heuristic makes certain assumptions on the form of asymptotic expan- 
sions, and is often considered a practically effective albeit non-rigorous method. The 
main term in the expansion of [6 J is consistent with Theorem [1] for t = 2. The proofs 
in this paper, presented in the next section, are based mostly on simple combinatorial 
arguments. 



2 Proof of the main result 



In the following lemma, we list some elementary properties of t-ary trees that will be 
referred to in the proof of Theorem [TJ For a discussion of these properties, see [TJ 
Sec. 2.3.4. 



Lemma 1 (i) Let £ be a positive integer, and let T be a [t,n,p] tree achieving min- 
imal path length among all t-ary trees with £ leaves. Then, 



n = £ + 



t 



Define 
and 



m = [log^l, 
t m -i 



t-1 



(2) 

(3) 
(4) 



Then, the profile of T is given by 



D 



(T) 



t\ 0<j<m-l, 
it, j = m, 
0, j > m, 



(5) 



In particular, all the leaves of T are either at depth m or m — 1 . 



1 Unless a base is explicitly specified, exp and log denote, respectively, the exponential and 
logarithm functions with respect to an arbitrary base that remains consistent throughout the paper. 

2 A slight change of terminology is required: nodes of t-ary trees in our terminology correspond 
to internal nodes of extended t-ary trees in [7]. 



3 



(ii) A [t,n,p] tree with minimal path length satisfies 



( 1 \ t(t^ — 1) 
P = Pmin = ( n + j— j- J \i - jr—^f = nl °Stn-0(n), (6) 

where fi — m whenever n ^ 2 mod t, or \x = m+1 otherwise, with m defined 
in |3j) for the number of leaves, i, of the tree. In particular, the tree of (i) 
satisfies (0|) with \x = m. 

(Hi) The number of nodes of a [t,n,p] tree satisfies 

n < ] ^— : r = =-^-(1 + (7) 

\og t p - O (log log p) \og t p 

(iv) The maximal path length of a [t,n] tree is achieved by a tree in which each 
internal node has exactly one child (hence, there is exactly one leaf in the tree). 
The path length of such a tree is 

n(n — 1) 

Pm&x ^. • (8) 



(v) There is a [t, n,p] tree for each p in the range p m - m < p < p max . 

Proof. Items (i),(ii), and (iv) follow immediately from the discussion in [3 
Sec. 2.3.4.5]. For convenience in the proof of Theorem [TJ we characterize, in Item 
(i), trees with minimal path length for a given number of leaves, while the discussion 
in [Tj does so for trees with a given number of nodes. The two characterizations 
coincide, except for values of n such that n = 2 modi, which never occur in (j2J). 
In that tree with n — 1 nodes would have the same number of leaves and a 

shorter path length. A tree that has minimal path length for its number of leaves, on 
the other hand, always has minimal path length also for its number of nodes (given 
in©). 

Item (iii) follows from (ii) by solving for n in an equation of the form p = n \og t n — 
0{n). Solutions of equations of this form are related to the Lambert W function, a 
detailed discussion of which can be found in [2] . 

To prove the claim of Item (v), consider a [t, n,p) tree T such that Dj > 1 for 
some integer j. Let jx be the largest such integer for the tree T. It follows from 
these assumptions that T must have nodes u and v at depth jx, such that u is a 
leaf, ij^m, and v has at most one child. Thus, we can transform T by deleting u 
and adding a child to v, and obtain a [t,n,p + 1] tree. Starting with a [t,n, p m i n ] 
tree, the transformation can be applied repeatedly to obtain a sequence of trees with 
consecutive values of p, as long as the transformed tree has at least two leaves. When 
this condition ceases to hold, we have the tree of Item (iv), which has path length 

Pmax • I— I 



4 



We will also rely on the following estimate of C t (n) derived from ([I]) using Stirling's 
approximation (see, e.g., [SJ Ch. 10]). For positive real numbers C\ and c 2 , which 
depend on t but not on n, we have 

cin~'^ exp (h(t~ l )t n) <C t (n) < c 2 n~% e xp (h(t~ l )t n). (9) 

Proof of Theorem [TJ. We recall that a = tlogt. 

(a) Upper bound: log L t (p) < (l + o(l)). 

logp 

Let n p denote the maximum number of nodes of any tree with path length equal 
to p. Clearly, we have 



L t {p) < )] C t (n) < n p C t {n p ) 



71=1 



and thus, by <M}, we obtain 



log L t (p) < log n p + log C t (n p ) <h(t x )tnp- ^ log n p + O(l) 
a 1 

= ^--^--lognp + O 1 . (10) 
logt 2 

The claimed upper bound on logLf(p) follows from ffTUj) by applying Lemma [T^iii) 
with n=n p . The asymptotic error term o(l) in the upper bound is, by ([7]), of the 
form O (log log p/ logp). 

(b) Lower bound: logLAp) > (1 + o(l)). 

logp 

We prove the lower bound by constructing a sufficiently large class of [t, - ,p\ trees. 

Let £ be a positive integer, £ > 2. We start with a t-ary tree T with £ leaves and 
shortest possible path length, as characterized in Lemma [T](i). Let q be the integer 
satisfying 

C t (q-l)<£-l<C t (q), (11) 

and let ri, r 2 , . . . , r^_i be the first £—1 distinct [t, q] trees, when [t, q] trees are arranged 
in non-decreasing order of path length. Additionally, let r F be a tree with f3 F q nodes, 
for some positive constant (3 F to be specified later. Finally, let ir be a permutation on 
{1, 2, . . . , £ — 1}. We construct a tree TV by attaching the trees n, r 2 , . . . , Tf_i and r F 
to the leaves of T, so that the i-th leaf (taken in some fixed order) becomes the root 
of a copy of T n u\, 1 < i < £. The tree t>, in turn, is attached to the last leaf of T, 
which is assumed to be at (the maximal) depth m. The construction is illustrated in 
Figure [Q 

Next, we compute the path length, p, oiT n . By Lemma[T](i), all the leaves of T are 
either at depth m = \log t £~] or at depth m—l. Assume 7$, 1 < i < £— 1, is attached 
to a leaf of depth m— l+e^, e. t G {0, 1}, of T. Also, let Vi denote the path length of 
T~i, 1 < i < ^— 1, and i/ F the path length of r F . The contribution of (excluding its 
root) to p is 

i>i j>i i>i 

= (m - 1 + €i)(q - 1) + v h 
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£o = i—t\ sub-trees7"7r((? +iJ"7r(<? +2) . . . t -k{i-\?f 



t\—\ sub-trees 



Figure 1: Tree T v 

Similarly, the contribution of r F to p is 

p F = m((3 F q — 1) + v F . 

Considering also the contribution of T according to its profile (151) , we obtain 

e-i e-i m-1 

p = ^2(m - 1 + €i)(q -l) + ^2v t + m(p F q -l) + v F + ^jt j + km . (12) 

i=l i=l j=l 

Further, observing that J^Zj £j = £\, and defining V = (£ — l) -1 Y^iZi v h we obtain 

171 — 1 

p={{£- l)(m - 1) + l x ) (q-l) + (£- l)V + m{(3 F q - 1) + v F + J] jt j + £ x m . (13) 

j=i 

Recall that the trees 7$ were selected preferring shorter path lengths, so their average 
path length V is at most as large as the average path length of all [t, q] trees. The 
latter average is known to be 0(q 3 ^ 2 ) (this follows from the results of [5]; see also [3 
Sec. 2.3.4.5] for t = 2). Observe also that, from the definition of q in ffTT]) . using (jUJ) 
and 0, and recalling that a = h(t~ 1 )t\ogt, we obtain 

lo 2 1 

q = m + 0(logm). (14) 

a 

Recalling now that n TF = /3 F q, and, hence, v F = 0(q 2 ), it follows, after standard 
algebraic manipulations, that ( ITBT) can be rewritten as 



V 



log 2 t 



a 
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m 2 £ + 0(m 3 / 2 £). (15) 



It also follows from (Tl3|) that p is independent of the choice of permutation tt. More- 
over, by construction, each permutation 7r defines a different tree T n , and, therefore, 
we have 

L t (p) > (£ - 1)\ . (16) 

From fflBT) . using Stirling's approximation, applying ([3]) and f fl5|) . and simplifying, we 
can write 



log L t {p) > log(l-l)! _ £\og£-Q{£) 
p p p 

£m\ogt-0(£) a{l-0{m- 1 )) 



a- 1 (log 2 1) m?£ + 0(m 3 / 2 £) m hgt + ( m - 



(17) 



Taking logarithms on both sides of ( fl5j) . and applying ([3]), we can write mlogt = 
logp — O(logm). Substituting for mlogi in ( FlTj) . and simplifying asymptotic expres- 
sions, we obtain 

!5^>^-(i-<KD), (is) 

p logp 

from which the desired lower bound follows. The o(l) term in (|18[) is 0(m~5) = 

0((l0gp) - 3). 

The above construction yields large classes of trees of path length p for a sparse 
sequence of values of p, controlled by the parameter £. Next, we show how the gaps 
in the sparse sequence can be filled, yielding constructions, and validating the lower 
bound, for all (sufficiently large) integer values of p. In the following discussion, when 
we wish to emphasize the dependency of m, £%, q, and p on £, we will use the notations 
m(£),£i(£),q(£), and p(£), respectively. Also, for any such function f(£), we denote 
by Af the difference f(£ + l) — f(£), with the value of £ being implied by the context. 
We start by estimating Ap. 

Assume first that £ is such that Ag = and Am = 0. Then, substituting £ + 1 
for £ in (TT3~1) . and subtracting the original equation, we obtain 

Ap= (m-l + Ae 1 ){q-l) + u t + Ae 1 m. (19) 



It follows from that, with m fixed, we have < A£\ < 2. Also, by (jHJ), we have 
V£ < \q 2 . Hence, recalling (fl4l) . it follows from ( f!9l) that 

AP < + I) ^ + 0{ - qlogq) - (20) 

Notice that, in fll3p . with all other parameters of the construction staying fixed, 
any increment in v F produces an identical change in p. By Lemma d^v), by an 
appropriate evolution of t>, we can make v F assume any value in the range (^ F )min < 
v F < (z^Omax, where (v F ) min = 0((3 F q\ogq), and (z/ F ) max = \(3 F q((3 F q - 1). Choosing 
(3 F > v/2a(lgt)" 2 + 1, this range of v F will make p span the gap between p{£) and 
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Figure 2: Bridging the gap in g-breaks 



•pit, + 1) as estimated in (1201) . for all sufficiently large £ satisfying the conditions of 
this case. Still, the variation in the value of p is asymptotically negligible and does 
not affect the validity of ( fT8l) . 

If Am = 1, we must have £ = £\{£) = t m , and £x(£+l) = 2. In this case, using (|T3l) 
again, we obtain 

Ap = (£m + 2)(q-l) + (3 F q-l + u e + m£ + 2(m + 1) 
-((£ - l)(m - 1) + £){q -l)-£m 
= (m + l)(q + l) + v e + /3 F q-l, 

which admits the same asymptotic upper bound as Ap in (1201) . Thus, the gap between 
p(£) and p{£ + 1) is filled also in this case by tuning the structure of r F . 

The above method cannot be applied directly when Aq = 1. We call a value of 
£ such that q(£ + 1) = q(£) + 1 a q-break. At a g-break, Ap is exponential in q, and 
a tree r F of polynomial size cannot compensate for such a gap. However, we observe 
that the construction of T n , and its asymptotic analysis in f|T3|) - f|T8|) would also be 
valid if we chose q' — q + 1, instead of q, as the size of the trees r,. This choice would 
produce a different sequence of path length values p'(£), which, when substituted for 
p, would also satisfy ( |T8l) and would validate the lower bound of the theorem. It 
follows from ( |T3l) that p'{£) > p{£)- Equivalently, for any given (sufficiently large) 
value £, there exists an integer £' < £ such that p'(£') < p(£) < p'{£' + 1). 

Consider a g-break £. To construct large classes of trees for all values of p, proceed 
as follows (refer to Figure [2]): use the original sequence of values p(£), filling the 
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gaps as described above, until £ — £. At that point, find the largest integer £' such 
that p'(£') < p(£), and "backtrack" to £ = £'. Continue with the sequence p'(£), 
£ — £',£' + 1, . . ., filling the gaps accordingly. Notice that q'(£) to the left of 1 is the 
same as q(£) to the right of that point. Thus, p'(£) continues "smoothly" (i.e., with 
gaps Ap as in (120]) ) into p(£) at £ = £. The process now rejoins the sequence p(£) 
as before, until the next g-break point. By (fT3|) . since the function m(£) remains the 
same for both p and p', we have, asymptotically, 

£' « (1 - l/q)£^£-c 3 £/\og£, 

for some positive constant C3. Thus, for sufficiently large £, although the difference 
between £' and £ is negligible with respect to £, £' is guaranteed to fall properly 
between g-breaks, and the number of sequence points p'{€) used between £' and £ is 
unbounded. □ 
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